2  Preparing Data for Tables

Now that we’ve covered the main types of variables, we can start thinking about how to prepare and visually represent data with figures and tables. In this section, we will focus on preparing data for tables.

In R, datasets are stored as dataframes, which are objects made up of rows and columns. Each row corresponds to an individual observation, while each column represents the variable that is being measured. There are several ways R stores variables: character for text, factor for categorical, integer for whole numbers, numeric for decimals, and logical for booleans (TRUE/FALSE).

Cleaning data can be the most time-consuming, yet arguably the most important, part of analysing data! Generally, it is

To understand R, we will examine these concepts using real-world datasets.

NAPLAN

Photo by N. K. on Unsplash

The first dataset to be explored is the NAPLAN dataset. NAPLAN stands for the National Assessment Program - Literacy and Numeracy, and is an annual national assessment for all students in year 3, 5, 7, and 9. The assessment is designed to evaluate whether students are building essential literacy and numeracy skills (https://www.nap.edu.au/naplan/faqs/naplan–general).

To load in the dataset, we can use the read.csv() function.

naplan <- read.csv("Data-sets/naplan_reading.csv")

Now that we have the dataset loaded in, let’s take a look at the variables in the dataset using str().

str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : int  51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : chr  "Year 7" "Year 3" "Year 7" "Year 3" ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : chr  "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
 $ school_type         : chr  "Government" "Government" "Government" "Independent" ...
 $ gender              : chr  "Female" "Female" "Male" "Female" ...
 $ birth_months        : chr  "Aug" "Apr" "Jul" "Sep" ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

In the top line, we can see that there are 3000 observations and 11 variables.

We can confirm this using the dim() function

dim(naplan)
[1] 3000   11

Using the names() function, we can see the variable names.

names(naplan)
 [1] "student_id"           "school_id"            "grade"               
 [4] "reading_time_home"    "parent_education"     "school_type"         
 [7] "gender"               "birth_months"         "n_siblings"          
[10] "ses_index"            "naplan_reading_score"

The first column gives us the variables, the second column gives us the type of variable, and the third column gives us individual observations for each row. However, we can see school_type is stored as a chr or character (text) variable. Text variables are very common when you encounter surveys that have customer feedback, for example, as each response will be unique from one another.

It is important to check that your variables are stored correctly. We can see certain variables like reading time at home is an integer, which matches our expectations.

You can use the unique() function to see the unique groupings or labels within a categorical variable. In our case, we can use it to confirm if our suspicions that school_type is categorical. To call a specific variable, we call the dataset followed by $ and then the variable name.

unique(naplan$school_type)
[1] "Government"  "Independent" "Catholic"   

There are only 3 unique labels, which strongly suggests that school type is a nominal categorical variable. If it were truly a text variable, we might see thousands of unique labels.

We can manually convert the variable to categorical using as.factor():

naplan$school_type <- as.factor(naplan$school_type)
str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : int  51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : chr  "Year 7" "Year 3" "Year 7" "Year 3" ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : chr  "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
 $ school_type         : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ gender              : chr  "Female" "Female" "Male" "Female" ...
 $ birth_months        : chr  "Aug" "Apr" "Jul" "Sep" ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

Now we can see school_type is a factor with 3 levels (or 3 unique labels).

Alternatively, we can also convert all strings (text) to categorical variables with the stringsAsFactors argument when we read in the dataset.

naplan <- read.csv("Data-sets/naplan_reading.csv",
                   stringsAsFactors = TRUE)
str(naplan)
'data.frame':   3000 obs. of  11 variables:
 $ student_id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school_id           : int  51 26 14 21 33 26 37 41 8 60 ...
 $ grade               : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
 $ reading_time_home   : int  70 70 30 0 20 25 5 90 45 10 ...
 $ parent_education    : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
 $ school_type         : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ gender              : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
 $ birth_months        : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
 $ n_siblings          : int  2 1 1 1 0 1 3 2 1 3 ...
 $ ses_index           : num  0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
 $ naplan_reading_score: int  724 439 631 491 635 592 449 484 730 528 ...

As seen above, all the chr variables have been converted to Factor, which is how the variables should be stored. Note that there might be times where you need a variable stored as chr, so you should always be sure of what you’re checking.

We can now attach the dataset. Attaching the dataset means we can call variables directly without using $. For instance, we can call school_type directly as opposed to naplan$school_type.

attach(naplan)

If we type naplan directly in, we can see the content.

naplan

Alternatively, we can use the View() function to open up a new tab in RStudio with the dataset.

View(naplan)

We can also select parts of the dataframe. In R, we subset by calling the dataset followed by [,] . Writing a blank space before the comma indicates that all rows are includes, while a blank space after the comma indicates that all columns are included. For example, [,3] selects the third column of a dataframe. If we want to select the first three columns, we can write:

# Select the first three rows and three columns of the Naplan dataset
naplan[1:3,1:3]
  student_id school_id  grade
1          1        51 Year 7
2          2        26 Year 3
3          3        14 Year 7

Perhaps we only want to select the middle rows for all the columns of the dataframe.

# Select the fifth and sixth row of the Naplan dataset with 
naplan[5:6,]
  student_id school_id  grade reading_time_home parent_education school_type
5          5        33 Year 9                20          Year 12  Government
6          6        26 Year 9                25 Year 10 or below  Government
  gender birth_months n_siblings ses_index naplan_reading_score
5   Male          Dec          0      0.27                  635
6 Female          Nov          1     -0.95                  592

We can even select rows based on logical tests on the values of one or more variables.

# Select all rows (observations) that have a reading time at 
# home greater than 30 mins and a naplan reading score greater than 800
naplan[reading_time_home > 30 & naplan_reading_score > 860,]
     student_id school_id  grade reading_time_home    parent_education
493         493         1 Year 9                65             Year 12
584         584         1 Year 9                40 Certificate/Diploma
741         741        42 Year 9               105             Year 12
1118       1118        42 Year 9                95 Certificate/Diploma
1168       1168        31 Year 9                60        Postgraduate
1174       1174        35 Year 9                65             Year 12
1447       1447        23 Year 9                75             Year 12
1792       1792         5 Year 9                70             Year 12
1906       1906         5 Year 9                70    Year 10 or below
1952       1952        35 Year 9                75             Year 12
2122       2122         5 Year 9                55        Postgraduate
2359       2359        35 Year 9                60     Bachelor degree
2425       2425        47 Year 9                50        Postgraduate
     school_type gender birth_months n_siblings ses_index naplan_reading_score
493  Independent   Male          Aug          0      1.05                  900
584  Independent Female          Sep          1      0.68                  900
741   Government   Male          Aug          5     -0.44                  900
1118  Government   Male          Jul          2     -0.28                  883
1168    Catholic Female          Aug          2      2.19                  871
1174  Government   Male          Aug          2     -0.33                  900
1447 Independent   Male          Sep          2      0.30                  900
1792  Government   Male          Aug          1      0.52                  900
1906  Government   Male          Jan          5     -0.71                  887
1952  Government   Male          Aug          1     -0.17                  900
2122  Government   Male          Jul          3      1.87                  900
2359  Government Female          Dec          5      1.33                  900
2425 Independent Female          Aug          2      2.02                  866

We can also sort the rows and the columns in the dataset. By default, R will sort in ascending order.

# Order the naplan reading score (ascending)
naplan[order(naplan$naplan_reading_score), ][1:10, ]
     student_id school_id  grade reading_time_home    parent_education
2511       2511        13 Year 3                85             Year 12
2495       2495        18 Year 3                70             Year 12
1158       1158        57 Year 3                60        Postgraduate
1053       1053        18 Year 3                70        Postgraduate
617         617        13 Year 7               100 Certificate/Diploma
2349       2349        18 Year 3                80             Year 12
1207       1207        18 Year 3                50    Year 10 or below
138         138        57 Year 3                35             Year 12
1697       1697        18 Year 5                55    Year 10 or below
467         467         9 Year 3                50    Year 10 or below
     school_type gender birth_months n_siblings ses_index naplan_reading_score
2511 Independent Female          Dec          3      0.67                  177
2495  Government Female          Jan          2     -0.15                  270
1158    Catholic Female          Jan          1      1.80                  273
1053  Government Female          Sep          0      1.99                  288
617  Independent   Male          Aug          1      1.45                  300
2349  Government   Male          Aug          1     -0.10                  303
1207  Government Female          Jan          2     -2.15                  309
138     Catholic Female          Mar          2      0.24                  312
1697  Government   Male          Oct          2     -1.12                  312
467     Catholic   Male          Aug          2     -0.45                  315

We can also sort and select odd-numbered columns.

# Order the naplan reading score (ascending) 
# and select only odd-numbered columns
naplan[order(naplan_reading_score), c(1, 3, 5, 7, 9, 11)][1:3,]
     student_id  grade parent_education gender n_siblings naplan_reading_score
2511       2511 Year 3          Year 12 Female          3                  177
2495       2495 Year 3          Year 12 Female          2                  270
1158       1158 Year 3     Postgraduate Female          1                  273

Or descending order.

# Order the naplan reading score (descending)
# and select only odd-numbered columns
naplan[rev(order(naplan_reading_score)), c(1, 3, 5, 7, 9, 11)][1:3,]
     student_id  grade parent_education gender n_siblings naplan_reading_score
2359       2359 Year 9  Bachelor degree Female          5                  900
2122       2122 Year 9     Postgraduate   Male          3                  900
1952       1952 Year 9          Year 12   Male          1                  900

We can summarise the content of the dataframe as well.

summary(naplan)
   student_id       school_id        grade     reading_time_home
 Min.   :   1.0   Min.   : 1.00   Year 3:735   Min.   :  0.00   
 1st Qu.: 750.8   1st Qu.:18.00   Year 5:771   1st Qu.:  5.00   
 Median :1500.5   Median :32.00   Year 7:745   Median : 15.00   
 Mean   :1500.5   Mean   :31.90   Year 9:749   Mean   : 21.62   
 3rd Qu.:2250.2   3rd Qu.:45.25                3rd Qu.: 30.00   
 Max.   :3000.0   Max.   :60.00                Max.   :120.00   
                                                                
            parent_education      school_type      gender      birth_months 
 Bachelor degree    :771     Catholic   : 540   Female:1451   Dec    : 325  
 Certificate/Diploma:701     Government :2023   Male  :1549   Aug    : 314  
 Postgraduate       :297     Independent: 437                 Jan    : 300  
 Year 10 or below   :466                                      Sep    : 299  
 Year 12            :765                                      Jul    : 296  
                                                              Nov    : 296  
                                                              (Other):1170  
   n_siblings      ses_index       naplan_reading_score
 Min.   :0.000   Min.   :-2.3200   Min.   :177.0       
 1st Qu.:1.000   1st Qu.:-0.1100   1st Qu.:491.0       
 Median :2.000   Median : 0.5300   Median :562.0       
 Mean   :1.899   Mean   : 0.5246   Mean   :558.5       
 3rd Qu.:3.000   3rd Qu.: 1.1500   3rd Qu.:618.2       
 Max.   :5.000   Max.   : 3.4100   Max.   :900.0       
                                                       

You might notice that the summary doesn’t quite work for categorical variables.

We can use tapply() and with() to summarise a quantitative variable by a qualitative variable.

with(naplan, tapply(naplan_reading_score, grade, mean))
  Year 3   Year 5   Year 7   Year 9 
457.1687 539.7821 594.5785 641.1722 

aggregate() allows you to summarise by one or more categorical variables, while tapply can only summarise one variable at a time.

aggregate(naplan[,c(4,11)], list(grade),mean)
  Group.1 reading_time_home naplan_reading_score
1  Year 3          21.59184             457.1687
2  Year 5          21.83528             539.7821
3  Year 7          22.24832             594.5785
4  Year 9          20.80774             641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade),mean)
   Grade reading_time_home naplan_reading_score
1 Year 3          21.59184             457.1687
2 Year 5          21.83528             539.7821
3 Year 7          22.24832             594.5785
4 Year 9          20.80774             641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade, "Parent Education" = parent_education),mean)
    Grade    Parent Education reading_time_home naplan_reading_score
1  Year 3     Bachelor degree          21.35000             478.5500
2  Year 5     Bachelor degree          19.63054             564.7783
3  Year 7     Bachelor degree          21.15591             618.3065
4  Year 9     Bachelor degree          18.48901             653.3187
5  Year 3 Certificate/Diploma          18.17919             457.6127
6  Year 5 Certificate/Diploma          21.78977             536.1705
7  Year 7 Certificate/Diploma          23.15217             595.2554
8  Year 9 Certificate/Diploma          22.44048             642.1786
9  Year 3        Postgraduate          20.06494             487.3117
10 Year 5        Postgraduate          20.19481             575.4026
11 Year 7        Postgraduate          23.10606             628.7727
12 Year 9        Postgraduate          17.20779             684.5325
13 Year 3    Year 10 or below          22.47475             431.2323
14 Year 5    Year 10 or below          25.49587             505.1983
15 Year 7    Year 10 or below          21.30435             558.1391
16 Year 9    Year 10 or below          21.48855             609.4809
17 Year 3             Year 12          25.18817             435.0914
18 Year 5             Year 12          22.55155             524.3351
19 Year 7             Year 12          22.70619             581.1546
20 Year 9             Year 12          22.56545             632.9686

Exercise: Palmer Penguins

Photo by Derek Oyen on Unsplash

First, we will need to load in some data into R. We will be working with the palmerpenguins package. The palmerpenguins package is a dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

If you have not used this package before, you will need to install it first.

install.packages("palmerpenguins")

Once installed, you will need to load the package into R.

# Load in the `palmerpenguins` package library(palmerpenguins)
# Load in the `palmerpenguins` package
library(palmerpenguins)

Now that we have the package installed, let’s take a look at the variables in the dataset using str() on the penguins dataset.

# Call str() on `penguins` str(penguins)
# Call str() on `penguins`
str(penguins)

You might notice that some values are listed as NA. This means there are missing observations. To check for missing values, try using the embedded any(is.na()) function.

# Try using any(is.na()) on the `penguins` dataset. any(is.na(penguins))
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))

Since this returns TRUE, we know there are missing values in the dataset. Let’s find out how many.

# Try using sum(is.na()) on the `penguins` dataset. sum(is.na(penguins))
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))

There are missing observations. This can cause problems when generating numerical summaries or running statistical tests. At this stage, to handle this, we can remove the missing values. It’s always best practice to store your cleaned data in a new dataset and leave the original untouched. Note that you should always be careful when removing observations, as even partially filled observations can still provide valuable information.

clean_penguins <- na.omit(penguins)
clean_penguins <- na.omit(penguins)
clean_penguins <- na.omit(penguins)

We can check the structure again to make sure everything looks good:

str(clean_penguins)
tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
  ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

We can see that there are eight variables. Three of these are categorical: species, island, and sex. These are stored as factors, where the levels represent the number of distinct labels or groups within each variable. For example, sex has two levels: male and female.

You’ll notice that there are four numerical variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and year. If your instinct is that year doesn’t quite fit as right as a numerical variable, you would be right! It is better to treat year as a categorical variable.

How do we manually convert year to a factor?

Now if we run class(), we can see that year is now correctly stored as a categorical variable with three levels.

class(clean_penguins$year)
[1] "factor"

How can we check the unique groupings or labels within year?

unique(clean_penguins$year)
unique(clean_penguins$year)

This shows that the data was collected over the years , , and (ascending order).

You can also see from the structure output str() that the penguins dataset is stored as a 333 x 8 tibble, which is a type of data frame in R. This means that there are 333 rows (observations) and 8 columns (variables).

Check the dimensions of the cleaned dataset to confirm this.

dim(clean_penguins)
dim(clean_penguins)

If you wish to have a closer look at inspecting the penguins dataset, you can use the View() function to open it in a new tab on RStudio.

View(penguins)
View(clean_penguins)

Tables

Now that we’ve inspected the dataset, we can start thinking about how to visually represent and understand the data.

As we have learnt

library(gt)
library(tidyverse)
library(kableExtra)
naplan %>% 
  dplyr::filter(reading_time_home > 30 & naplan_reading_score > 800) %>%
  gt() %>%
  tab_header(title = "Naplan") %>%
  fmt_number(decimals=2) %>%
  cols_label(
    student_id = md("***Student ID***"),
    school_id = md("***School ID***"),
    grade = md("***Grade***"),
    reading_time_home = md("***Reading Time***"),
    parent_education = md("***Parent Education***"),
    school_type = md("***School Type***"),
    gender = md("***Gender***"),
    birth_months = md("***Birth Month***"),
    n_siblings = md("***Number of Siblings***"),
    ses_index = md("***SES Index***"),
    naplan_reading_score = md("***Reading Score***")
  )
Naplan

Student ID

School ID

Grade

Reading Time

Parent Education

School Type

Gender

Birth Month

Number of Siblings

SES Index

Reading Score

493.00 1.00 Year 9 65.00 Year 12 Independent Male Aug 0.00 1.05 900.00
584.00 1.00 Year 9 40.00 Certificate/Diploma Independent Female Sep 1.00 0.68 900.00
626.00 35.00 Year 9 40.00 Bachelor degree Government Female Dec 2.00 1.16 804.00
741.00 42.00 Year 9 105.00 Year 12 Government Male Aug 5.00 −0.44 900.00
1,118.00 42.00 Year 9 95.00 Certificate/Diploma Government Male Jul 2.00 −0.28 883.00
1,168.00 31.00 Year 9 60.00 Postgraduate Catholic Female Aug 2.00 2.19 871.00
1,174.00 35.00 Year 9 65.00 Year 12 Government Male Aug 2.00 −0.33 900.00
1,447.00 23.00 Year 9 75.00 Year 12 Independent Male Sep 2.00 0.30 900.00
1,792.00 5.00 Year 9 70.00 Year 12 Government Male Aug 1.00 0.52 900.00
1,906.00 5.00 Year 9 70.00 Year 10 or below Government Male Jan 5.00 −0.71 887.00
1,952.00 35.00 Year 9 75.00 Year 12 Government Male Aug 1.00 −0.17 900.00
2,122.00 5.00 Year 9 55.00 Postgraduate Government Male Jul 3.00 1.87 900.00
2,359.00 35.00 Year 9 60.00 Bachelor degree Government Female Dec 5.00 1.33 900.00
2,425.00 47.00 Year 9 50.00 Postgraduate Independent Female Aug 2.00 2.02 866.00
2,691.00 55.00 Year 9 80.00 Year 12 Government Female Dec 2.00 −0.18 837.00
2,712.00 31.00 Year 9 60.00 Year 12 Catholic Male Dec 2.00 0.81 809.00
2,756.00 53.00 Year 9 55.00 Year 10 or below Government Female Mar 1.00 −0.40 840.00
naplan %>% 
  filter(gender == "Female", school_type == "Independent") %>%
  arrange(desc(naplan_reading_score)) %>%
  slice_head(n = 10) %>%
  gt() %>%
  tab_header(title = "Naplan") %>%
  fmt_number(decimals = 2) %>%
  cols_label(
    student_id = md("***Student ID***"),
    school_id = md("***School ID***"),
    grade = md("***Grade***"),
    reading_time_home = md("***Reading Time***"),
    parent_education = md("***Parent Education***"),
    school_type = md("***School Type***"),
    gender = md("***Gender***"),
    birth_months = md("***Birth Month***"),
    n_siblings = md("***Number of Siblings***"),
    ses_index = md("***SES Index***"),
    naplan_reading_score = md("***Reading Score***")
  ) %>%
  data_color(
    columns = naplan_reading_score,
    method = "numeric",
    palette = "viridis",
    domain = c(0, 1000),
    reverse = TRUE
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "gray95"),
      cell_borders(sides = c("l", "r"), color = "gray50", weight = px(3))
    ),
    locations = cells_body(columns = -naplan_reading_score)
  ) %>%
  tab_style(
    style = cell_fill(color = "gray98"),
    locations = cells_title()
  ) %>%
  tab_style(
    style = list(cell_fill(color = "gray35"), cell_text(color = "white")),
    locations = list(cells_footnotes(), cells_source_notes())
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_title(groups = "title")
  ) %>% 
  tab_style(
    style = cell_borders(
      sides = c("t", "b"),
      color = "darkgrey",
      weight = px(3)
    ),
    locations = list(cells_column_labels(), cells_stubhead())
  ) %>%
tab_footnote(
    footnote = "SES Index represents ",
    locations = cells_column_labels(columns = ses_index)
  )
Naplan

Student ID

School ID

Grade

Reading Time

Parent Education

School Type

Gender

Birth Month

Number of Siblings

SES Index

1

Reading Score

584.00 1.00 Year 9 40.00 Certificate/Diploma Independent Female Sep 1.00 0.68 900.00
2,425.00 47.00 Year 9 50.00 Postgraduate Independent Female Aug 2.00 2.02 866.00
1,007.00 1.00 Year 9 25.00 Bachelor degree Independent Female Apr 0.00 1.09 831.00
887.00 1.00 Year 7 50.00 Bachelor degree Independent Female Jul 1.00 0.85 800.00
2,166.00 2.00 Year 7 60.00 Bachelor degree Independent Female May 1.00 2.40 800.00
2,610.00 23.00 Year 9 35.00 Bachelor degree Independent Female Oct 1.00 1.44 796.00
305.00 1.00 Year 9 35.00 Certificate/Diploma Independent Female Apr 3.00 0.46 788.00
1,220.00 47.00 Year 9 35.00 Certificate/Diploma Independent Female Jul 5.00 0.76 782.00
133.00 47.00 Year 9 55.00 Year 10 or below Independent Female Sep 3.00 0.36 780.00
1,383.00 47.00 Year 9 30.00 Certificate/Diploma Independent Female Sep 2.00 1.53 774.00
1 SES Index represents
naplan |>
  dplyr::group_by(gender) |>
  dplyr::summarise(
    mean_score = mean(naplan_reading_score, na.rm = TRUE),
    mean_ses   = mean(ses_index, na.rm = TRUE),
    .groups = "drop"
  ) |>
  tidyr::pivot_wider(
    names_from = gender,
    values_from = c(mean_score, mean_ses)
  ) |>
  gt() |>
  fmt_number(columns = everything(), decimals = 2) |>
  tab_spanner(
    label = "Female",
    columns = matches("_Female$")
  ) |>
  tab_spanner(
    label = "Male",
    columns = matches("_Male$")
  ) |>
  cols_label(
    matches("mean_score") ~ "Reading Score<br>Average",
    matches("mean_ses") ~ "SES Index<br>Average",
    .fn=md
  ) |>
  cols_width(everything() ~ px(120))
Female Male

Reading Score
Average

SES Index
Average

Reading Score
Average

SES Index
Average

563.00 0.50 554.21 0.55
library(gtsummary)
naplan |>
  tbl_summary(
    by = gender,
    include=c(grade, reading_time_home, naplan_reading_score),
    statistic=list(all_continuous()~"{mean} ({sd})"),
    digits=list(naplan_reading_score = c(0,1),
                reading_time_home = c(0,1))
  )

Characteristic

Female
N = 1,451

1

Male
N = 1,549

1
grade

    Year 3 350 (24%) 385 (25%)
    Year 5 378 (26%) 393 (25%)
    Year 7 356 (25%) 389 (25%)
    Year 9 367 (25%) 382 (25%)
reading_time_home 22 (22.9) 21 (22.7)
naplan_reading_score 563 (95.7) 554 (92.0)
1

n (%); Mean (SD)

naplan |>
  tbl_summary(
    include=c(reading_time_home, naplan_reading_score),
    type=all_continuous() ~ "continuous2",
    statistic=list(all_continuous()~c("{median} ({p25}, {p75})", "{min}, {max}"))
    )

Characteristic

N = 3,000

reading_time_home
    Median (Q1, Q3) 15 (5, 30)
    Min, Max 0, 120
naplan_reading_score
    Median (Q1, Q3) 562 (491, 619)
    Min, Max 177, 900
naplan %>%
  filter(gender == "Female", school_type == "Independent") %>%
  arrange(desc(naplan_reading_score)) %>%
  slice_head(n = 10) %>%
  select(
    `Student ID` = student_id,
    `School ID` = school_id,
    Grade = grade,
    `Reading Time` = reading_time_home,
    `Parent Education` = parent_education,
    `School Type` = school_type,
    Gender = gender,
    `Birth Month` = birth_months,
    `Number of Siblings` = n_siblings,
    `SES Index` = ses_index,
    `Reading Score` = naplan_reading_score
  ) %>%
  mutate(
    `Reading Score` = cell_spec(
      `Reading Score`,
      background = spec_color(`Reading Score`, option = "viridis", end = 0.95),
      color = "white"
    )
  ) %>%
  kable(
    format = "html",
    align = "c",
    booktabs = TRUE,
    escape = FALSE,
    caption = "Naplan"
  ) %>%
  kable_styling(full_width = FALSE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "gray35") %>%
  column_spec(
    1:10,
    background = "gray95",
    extra_css = "border-left: 3px solid gray50; border-right: 3px solid gray50;"
  )  %>%  footnote(general = "SES Index represents ...", general_title = "")
Naplan
Student ID School ID Grade Reading Time Parent Education School Type Gender Birth Month Number of Siblings SES Index Reading Score
584 1 Year 9 40 Certificate/Diploma Independent Female Sep 1 0.68 900
2425 47 Year 9 50 Postgraduate Independent Female Aug 2 2.02 866
1007 1 Year 9 25 Bachelor degree Independent Female Apr 0 1.09 831
887 1 Year 7 50 Bachelor degree Independent Female Jul 1 0.85 800
2166 2 Year 7 60 Bachelor degree Independent Female May 1 2.40 800
2610 23 Year 9 35 Bachelor degree Independent Female Oct 1 1.44 796
305 1 Year 9 35 Certificate/Diploma Independent Female Apr 3 0.46 788
1220 47 Year 9 35 Certificate/Diploma Independent Female Jul 5 0.76 782
133 47 Year 9 55 Year 10 or below Independent Female Sep 3 0.36 780
1383 47 Year 9 30 Certificate/Diploma Independent Female Sep 2 1.53 774
SES Index represents ...